COVID-19 Data Analysis¶
This project explores COVID-19 vaccination rates, infection rates, and death counts. We use descriptive statistics and data visualization to identify trends and patterns.
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
The dataset contains COVID data from around 187 countries.
data = pd.read_csv('OneDrive/NSDCprojects/Capsule_COVID_Data_Visualization/country_wise_latest.csv')
data.head()
| Country/Region | Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | WHO Region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 36263 | 1269 | 25198 | 9796 | 106 | 10 | 18 | 3.50 | 69.49 | 5.04 | 35526 | 737 | 2.07 | Eastern Mediterranean |
| 1 | Albania | 4880 | 144 | 2745 | 1991 | 117 | 6 | 63 | 2.95 | 56.25 | 5.25 | 4171 | 709 | 17.00 | Europe |
| 2 | Algeria | 27973 | 1163 | 18837 | 7973 | 616 | 8 | 749 | 4.16 | 67.34 | 6.17 | 23691 | 4282 | 18.07 | Africa |
| 3 | Andorra | 907 | 52 | 803 | 52 | 10 | 0 | 0 | 5.73 | 88.53 | 6.48 | 884 | 23 | 2.60 | Europe |
| 4 | Angola | 950 | 41 | 242 | 667 | 18 | 1 | 0 | 4.32 | 25.47 | 16.94 | 749 | 201 | 26.84 | Africa |
At first glance, we can examine whether there is a relationship between the number of confirmed cases and the number of deaths.
print(data[['Deaths','Recovered']].corr())
sns.scatterplot(x='Deaths', y='Recovered', data=data)
plt.show()
Deaths Recovered Deaths 1.000000 0.832098 Recovered 0.832098 1.000000
The statistical data about the data:
data.describe()
| Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.870000e+02 | 187.000000 | 1.870000e+02 | 1.870000e+02 | 187.000000 | 187.000000 | 187.000000 | 187.000000 | 187.000000 | 187.00 | 1.870000e+02 | 187.000000 | 187.000000 |
| mean | 8.813094e+04 | 3497.518717 | 5.063148e+04 | 3.400194e+04 | 1222.957219 | 28.957219 | 933.812834 | 3.019519 | 64.820535 | inf | 7.868248e+04 | 9448.459893 | 13.606203 |
| std | 3.833187e+05 | 14100.002482 | 1.901882e+05 | 2.133262e+05 | 5710.374790 | 120.037173 | 4197.719635 | 3.454302 | 26.287694 | NaN | 3.382737e+05 | 47491.127684 | 24.509838 |
| min | 1.000000e+01 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.000000e+01 | -47.000000 | -3.840000 |
| 25% | 1.114000e+03 | 18.500000 | 6.265000e+02 | 1.415000e+02 | 4.000000 | 0.000000 | 0.000000 | 0.945000 | 48.770000 | 1.45 | 1.051500e+03 | 49.000000 | 2.775000 |
| 50% | 5.059000e+03 | 108.000000 | 2.815000e+03 | 1.600000e+03 | 49.000000 | 1.000000 | 22.000000 | 2.150000 | 71.320000 | 3.62 | 5.020000e+03 | 432.000000 | 6.890000 |
| 75% | 4.046050e+04 | 734.000000 | 2.260600e+04 | 9.149000e+03 | 419.500000 | 6.000000 | 221.000000 | 3.875000 | 86.885000 | 6.44 | 3.708050e+04 | 3172.000000 | 16.855000 |
| max | 4.290259e+06 | 148011.000000 | 1.846641e+06 | 2.816444e+06 | 56336.000000 | 1076.000000 | 33728.000000 | 28.560000 | 100.000000 | inf | 3.834677e+06 | 455582.000000 | 226.320000 |
Next, we will choose a specific WHO Region as the focus of the first part of our data visualization.
who_region_data = data.groupby('WHO Region').sum().reset_index()
# Plot the trends for confirmed cases, deaths, and recoveries across WHO regions
fig, axes = plt.subplots(3, 1, figsize=(8,12))
sns.barplot(ax=axes[0], x='WHO Region', y='Confirmed', data=who_region_data)
axes[0].set_title('Confirmed Cases by WHO Region')
axes[0].set_ylabel('Confirmed Count Across WHO Regions')
sns.barplot(ax=axes[1], x='WHO Region', y='Deaths', data=who_region_data)
axes[1].set_title('Deaths by WHO Region')
axes[1].set_ylabel('Death Count Across WHO Regions')
sns.barplot(ax=axes[2], x='WHO Region', y='Recovered', data=who_region_data)
axes[2].set_title('Recoveries by WHO Region')
axes[2].set_ylabel('Recovered Count Across WHO Regions')
plt.tight_layout()
plt.show()
Because the Amercias had the most confirmed, deaths, and recovery cases among all WHO Regions, we will first focus our analysis on cases in the Americas.
Data Analysis of Cases in the Americas¶
d_america = data[data['WHO Region'] == 'Americas'].sort_values(by='Confirmed', ascending = False)
new_data = d_america.head(10) #top 10 countries
new_data
| Country/Region | Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | WHO Region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 173 | US | 4290259 | 148011 | 1325804 | 2816444 | 56336 | 1076 | 27941 | 3.45 | 30.90 | 11.16 | 3834677 | 455582 | 11.88 | Americas |
| 23 | Brazil | 2442375 | 87618 | 1846641 | 508116 | 23284 | 614 | 33728 | 3.59 | 75.61 | 4.74 | 2118646 | 323729 | 15.28 | Americas |
| 111 | Mexico | 395489 | 44022 | 303810 | 47657 | 4973 | 342 | 8588 | 11.13 | 76.82 | 14.49 | 349396 | 46093 | 13.19 | Americas |
| 132 | Peru | 389717 | 18418 | 272547 | 98752 | 13756 | 575 | 4697 | 4.73 | 69.93 | 6.76 | 357681 | 32036 | 8.96 | Americas |
| 35 | Chile | 347923 | 9187 | 319954 | 18782 | 2133 | 75 | 1859 | 2.64 | 91.96 | 2.87 | 333029 | 14894 | 4.47 | Americas |
| 37 | Colombia | 257101 | 8777 | 131161 | 117163 | 16306 | 508 | 11494 | 3.41 | 51.02 | 6.69 | 204005 | 53096 | 26.03 | Americas |
| 6 | Argentina | 167416 | 3059 | 72575 | 91782 | 4890 | 120 | 2057 | 1.83 | 43.35 | 4.21 | 130774 | 36642 | 28.02 | Americas |
| 32 | Canada | 116458 | 8944 | 0 | 107514 | 682 | 11 | 0 | 7.68 | 0.00 | inf | 112925 | 3533 | 3.13 | Americas |
| 51 | Ecuador | 81161 | 5532 | 34896 | 40733 | 467 | 17 | 0 | 6.82 | 43.00 | 15.85 | 74620 | 6541 | 8.77 | Americas |
| 20 | Bolivia | 71181 | 2647 | 21478 | 47056 | 1752 | 64 | 309 | 3.72 | 30.17 | 12.32 | 60991 | 10190 | 16.71 | Americas |
new_data.describe()
| Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.000000e+01 | 10.000000 | 1.000000e+01 | 1.000000e+01 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.0000 | 1.000000e+01 | 10.000000 | 10.000000 |
| mean | 8.559080e+05 | 33621.500000 | 4.328866e+05 | 3.893999e+05 | 12457.900000 | 340.200000 | 9067.300000 | 4.900000 | 51.276000 | inf | 7.576744e+05 | 98233.600000 | 13.644000 |
| std | 1.398190e+06 | 48126.771919 | 6.313155e+05 | 8.643672e+05 | 17243.211981 | 350.807893 | 12164.741273 | 2.832749 | 27.626579 | NaN | 1.242506e+06 | 157603.809848 | 8.272305 |
| min | 7.118100e+04 | 2647.000000 | 0.000000e+00 | 1.878200e+04 | 467.000000 | 11.000000 | 0.000000 | 1.830000 | 0.000000 | 2.8700 | 6.099100e+04 | 3533.000000 | 3.130000 |
| 25% | 1.291975e+05 | 6343.250000 | 4.431575e+04 | 4.720625e+04 | 1847.250000 | 66.750000 | 696.500000 | 3.420000 | 33.925000 | 5.2275 | 1.173872e+05 | 11366.000000 | 8.817500 |
| 50% | 3.025120e+05 | 9065.500000 | 2.018540e+05 | 9.526700e+04 | 4931.500000 | 231.000000 | 3377.000000 | 3.655000 | 47.185000 | 8.9600 | 2.685170e+05 | 34339.000000 | 12.535000 |
| 75% | 3.940460e+05 | 37621.000000 | 3.159180e+05 | 1.147508e+05 | 15668.500000 | 558.250000 | 10767.500000 | 6.297500 | 74.190000 | 13.9475 | 3.556098e+05 | 51345.250000 | 16.352500 |
| max | 4.290259e+06 | 148011.000000 | 1.846641e+06 | 2.816444e+06 | 56336.000000 | 1076.000000 | 33728.000000 | 11.130000 | 91.960000 | inf | 3.834677e+06 | 455582.000000 | 28.020000 |
The first plot is a bar plot describing the active cases among countries in the Americas.
active = new_data[['Country/Region','Active']].sort_values(by='Active',ascending=False)
sns.barplot(y=active.get('Country/Region'),x=active.get('Active'))
<Axes: xlabel='Active', ylabel='Country/Region'>
Insights:
- For countries in the Americas, the US had far more active counts than any other countries, with almost five times more active counts than Brazil, the country with the second most active counts. This strongly suggests that the US suffered the most from COVID-19 within the Americas.
The second chart is a double bar chart to compare Confirmed and Recovered COVID Cases in different countries
plt.figure(figsize=(10, 5))
X_axis = np.arange(len(new_data['Confirmed']))
plt.bar(X_axis - 0.2,new_data['Confirmed'] , 0.4, label = 'Confirmed')
plt.bar(X_axis + 0.2, new_data['Recovered'], 0.4, label = 'Recovered')
plt.xticks(X_axis,new_data['Country/Region'] )
plt.xlabel("Confirmed")
plt.ylabel("Recovered")
plt.title("Comparing the Confirmed and Recovered Counts in the Americas")
plt.legend()
plt.show()
Insights:
- For countries in the Americas, the US and Brazil has the most confirmed and recovered patients.
- While the US had more confirmed counts, Brazil had more Recovered patients, suggesting Brazil had better control over the pandemic than the US.
- Smaller countries like Mexico and Chile had about the same amount of confirmed and recovered patients, which means these countries had effective control over the pandemic.
The third is a pie chart describing the recovered cases.
plt.figure(figsize=(8,8))
patches, text, autotexts = plt.pie(new_data['Recovered'], labels = new_data['Country/Region'],autopct="%0.2f%%", pctdistance=0.8)
plt.title("Distribution of Recovered COVID Cases") #Hint: Check the TODO statement!
plt.axis('equal')
plt.legend(patches,new_data['Country/Region'] )
plt.show()
Insights:
- Brazil had the most amount of recovered COVID cases, suggesting its effective control over the pandemic.
- Canada had no recovered COVID cases, suggesting the need for improvements in its control strategies.
Next, we are going to compare Active cases in the Americas using a donut chart.
x = new_data['Active'].to_list()
labels = new_data['Country/Region']
colors = ['#0F52BA','#4169E1', '#0096FF',
'#87CEEB','#89CFF0', '#7DF9FF']
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
plt.pie(x, colors=colors, labels=labels,
autopct='%1.1f%%', pctdistance=0.85,
explode=explode)
centre_circle = plt.Circle((0, 0), 0.65, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Active Cases in the Americas')
plt.show()
Insights:
- The US had the highest number of active cases in the Americas, accounting for over 70% of all active cases in the region. This indicates that the COVID situation in the US was the most severe in the Americas.
We will then visualize the amount of confirmed cases in the Americas through a heatmap with Choropleth.
fig = px.choropleth(new_data,
locations= 'Country/Region',
locationmode='country names',
color='Confirmed',
color_continuous_scale='Reds',
hover_name='Country/Region',
title='Total Confirmed Cases in the Americas')
fig.show()
Insights:
- North America generally had more confirmed COVID cases than South Anmerica.
- Although the US had the most serious pandemic, Canada wasn't affected really seriously.
Finally, in order to see the influence of the pandemic, we will calculate mortality and recovery rates, aggregate the data by WHO region, and then plot these rates using a barplot.
data['Mortality Rate'] = (data['Deaths'] / data['Confirmed']) * 100
data['Recovery Rate'] = (data['Recovered'] / data['Confirmed']) * 100
# Aggregate data by WHO Region
numeric_columns = ['Confirmed', 'Deaths', 'Recovered', 'Active', 'Mortality Rate', 'Recovery Rate']
who_region_rates = data.groupby('WHO Region')[numeric_columns].mean().reset_index()
fig, axes = plt.subplots(2, 1, figsize=(8, 8))
sns.barplot(ax=axes[0], x='WHO Region', y='Mortality Rate', data=who_region_rates)
axes[0].set_title('Mortality Rate by WHO Region')
axes[0].set_ylabel('Mortality Rate (%)')
sns.barplot(ax=axes[1], x='WHO Region', y='Mortality Rate', data=who_region_rates)
axes[1].set_title('Mortality Rate by WHO Regions')
axes[1].set_ylabel('Mortality Rate)')
plt.tight_layout()
plt.show()
Insights:
- Europe had the most mortality rate, suggesting the most serious pandemic conditions in Europe.
- South-East Asia and Western Pacific had the least mortality rate, suggesting their effective control during COVID-19.
Data Analysis for Cases in Southeasst Asia¶
Southeast Asia was chosen as the focus region because it is geographically close to China, where COVID-19 was first reported. Due to this proximity, countries in Southeast Asia were exposed to the virus relatively early, making the region important for understanding how the outbreak spread and how different countries were affected.
asia_data = data[data.get('WHO Region') == 'South-East Asia']
asia_data
| Country/Region | Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | WHO Region | Mortality Rate | Recovery Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | Bangladesh | 226225 | 2965 | 125683 | 97577 | 2772 | 37 | 1801 | 1.31 | 55.56 | 2.36 | 207453 | 18772 | 9.05 | South-East Asia | 1.310642 | 55.556636 |
| 19 | Bhutan | 99 | 0 | 86 | 13 | 4 | 0 | 1 | 0.00 | 86.87 | 0.00 | 90 | 9 | 10.00 | South-East Asia | 0.000000 | 86.868687 |
| 27 | Burma | 350 | 6 | 292 | 52 | 0 | 0 | 2 | 1.71 | 83.43 | 2.05 | 341 | 9 | 2.64 | South-East Asia | 1.714286 | 83.428571 |
| 79 | India | 1480073 | 33408 | 951166 | 495499 | 44457 | 637 | 33598 | 2.26 | 64.26 | 3.51 | 1155338 | 324735 | 28.11 | South-East Asia | 2.257186 | 64.264803 |
| 80 | Indonesia | 100303 | 4838 | 58173 | 37292 | 1525 | 57 | 1518 | 4.82 | 58.00 | 8.32 | 88214 | 12089 | 13.70 | South-East Asia | 4.823385 | 57.997268 |
| 106 | Maldives | 3369 | 15 | 2547 | 807 | 67 | 0 | 19 | 0.45 | 75.60 | 0.59 | 2999 | 370 | 12.34 | South-East Asia | 0.445236 | 75.601069 |
| 119 | Nepal | 18752 | 48 | 13754 | 4950 | 139 | 3 | 626 | 0.26 | 73.35 | 0.35 | 17844 | 908 | 5.09 | South-East Asia | 0.255973 | 73.346843 |
| 158 | Sri Lanka | 2805 | 11 | 2121 | 673 | 23 | 0 | 15 | 0.39 | 75.61 | 0.52 | 2730 | 75 | 2.75 | South-East Asia | 0.392157 | 75.614973 |
| 167 | Thailand | 3297 | 58 | 3111 | 128 | 6 | 0 | 2 | 1.76 | 94.36 | 1.86 | 3250 | 47 | 1.45 | South-East Asia | 1.759175 | 94.358508 |
| 168 | Timor-Leste | 24 | 0 | 0 | 24 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 24 | 0 | 0.00 | South-East Asia | 0.000000 | 0.000000 |
The following scatter plot describes the relationship between the 1 week % increase and the recovery rate. The color of each point represents the amount of confirmed cases. The larger the point, the more confirmed cases.
plt.scatter(asia_data.get('1 week % increase'), asia_data.get('Recovery Rate'), c = asia_data.get('Confirmed'), cmap = 'cividis')
plt.xlabel('1 Week % Increase')
plt.ylabel('Recovery Rate')
plt.title('Recovery Rate vs. 1 Week % Increase')
for i in range(len(asia_data)):
x = asia_data.get('1 week % increase').iloc[i]
y = asia_data.get('Recovery Rate').iloc[i]
label = asia_data.get('Country/Region').iloc[i]
plt.annotate(
label,
(x, y),
textcoords="offset points",
xytext=(5, 5),
ha='center',
fontsize=8
)
plt.show()
Insights:
- Overall, there is no strong linear relationship between the two variables. Countries with a low weekly increase can still have either high or low recovery rates. For example, Thailand, Bhutan, and Sri Lanka have low 1-week increases but relatively high recovery rates, suggesting that slower case growth may be associated with better recovery outcomes in some countries.
- On the other hand, India stands out as an outlier. It has the highest 1-week percentage increase, while its recovery rate is only moderate compared to other countries. This may indicate that rapid case growth can put pressure on healthcare systems, slowing recovery.
- Timor-Leste is another notable outlier, with both a very low increase and an extremely low recovery rate. This could reflect limited healthcare capacity or delays in reporting recoveries.
The second Data Visualization of data in Southeast Asia is a heatmap showcasing the mortality rate.
fig = px.choropleth(
asia_data,
locations='Country/Region',
locationmode='country names',
color='Mortality Rate',
hover_name='Country/Region',
color_continuous_scale='Blues',
title='COVID-19 Mortality Rate in Asia'
)
fig.show()
Insights:
- From the heatmap, countries in Southeast Asia show noticeable variation in mortality rates. Some countries are shaded darker blue, indicating higher mortality rates, while others remain much lighter. This suggests that even within the same region, outcomes differed significantly.
- Also, countries closer to China showed a lighter color, which means that geographic proximity alone does not fully determine mortality rate. Factors such as healthcare capacity, population density, government response, and reporting practices likely played a role in these differences.
The third visualization on the Southeast Asia region is a pie chart describing the amount of active cases in the top five countries.
top5 = asia_data.sort_values('Active', ascending=False).head(5)
others = asia_data['Active'].sum() - top5['Active'].sum()
labels = list(top5['Country/Region']) + ['Others']
sizes = list(top5['Active']) + [others]
plt.figure(figsize=(10, 10))
wedges, texts, autotexts = plt.pie(
sizes,
labels=labels,
autopct='%1.1f%%',
startangle=140,
colors = plt.cm.YlOrBr(np.linspace(0.4, 1, len(sizes))),
wedgeprops={'edgecolor': 'white', 'linewidth': 1},
pctdistance=0.5
)
for text in texts:
text.set_fontsize(10)
for text in autotexts:
text.set_fontsize(10)
text.set_fontweight('bold')
plt.title('Distribution of Active COVID-19 Cases in Asia (Top 5 Countries)')
plt.axis('equal')
plt.show()
Insights:
- A small number of countries account for a large proportion of active cases, while the remaining countries contribute a much smaller share.
- While according to the pie chart, India had the most active cases, it's mortality rate wasn't the highest, suggesting its effective control over the pandemic.